A reusable corpus needs syntactic annotations: Prague Dependency Treebank
نویسندگان
چکیده
Prague Dependency Treebank (PDT, i.e. an annotated part of the Czech National Corpus) is conceived as a three-layer system of tags; the individual layers can be characterized as follows: (i) morphemic tagging capturing relatively disambiguated values of morphemic categories based on a full morphemic analysis of Czech; (ii) syntactic tags at the so-called analytical level, capturing the functions of individual word forms; in the analytical tree structures (ATSs), every word token and punctuation mark has a corresponding node and is analyzed as for its POS and morphemic value, as well as for the main syntactic functions ('analytical functors', 'afuns'); among the afuns, Subj, Obj, Adv are not classified in a more subtle way; (iii) syntactic tags at the tectogrammatical level (TGTSs) rendering the underlying (tectogrammatical) structure of the sentence, i.e., its syntactic structure proper (with a detailed classification of underlying syntactic functions). In the sequel we focus on a brief characterization of the TGTSs and on issues that are specific for the PDT scenario and are crucial, especially from the linguistic point of view. These issues concern (i) the transition from ATSs to TGTSs, (ii) the assignment of the features of the information structure of the sentence (topic-focus articulation), and (iii) a tentative treatment of coreference relations. The TGTSs are based on dependency syntax; the tagging at this level is guided by the following principles: (a) a node of a TGTS represents an autosemantic (lexical) word; the correlates of synsemantic (functional, auxiliary) words are attached to the autosemantic words to which they belong; (b) in the cases of deletion in the surface shape of the sentence, further nodes are supplied into the TGTS to 'recover' a deleted word; (c) no non-projective structures are admitted in the TGTSs (they are supposed to be solved by movement rules between the ATS and the TGTS); (d) not only the direction of the dependence on the governing node (dependence to the left, dependence to the right) is taken into account, but also sister nodes are ordered (from left to right). 1. Introductory remark Thanks to the pioneering work of a small group of linguists, among whom Geoffrey Leech with his exceptional theoretical involvement and fully competent initiative belongs to the most prominent personalities, linguistic elaboration of large corpora has become the major centre of interest. Its impact for future linguistic studies and applications (in lexicography, stylistics, literary studies and elsewhere) will keep growing, especially with a continuation of the work on tagging the corpora in grammatical and other aspects. A large corpus, if syntactically annotated, can offer a quite new level of investigations, which may use the data gained by semi-automatic tagging procedures and make them more precise by monographic analyses. The existence of the large Czech National Corpus (initiated by F. Čermák) has allowed for the creation of the Prague Dependency Treebank (PDT), the scheme of the grammatical tagging of which is based on the theoretical linguistic framework of the Functional Generative Description (see Sgall et al. 1986, Hajičová et al. 1998); we believe that a consistent linguistic basis has helped us to develop a complex scenario that covers both the core of language and many of the more or less frequent peripheral phenomena. 2. Morphemic and analytical tagging The first phases of the tagging procedure (see Hajič 1998) consist of morphemic and "surface" annotations, during which the intermediate 'analytical level' is achieved; the analytical tree structures (ATSs) contain a node for every token of a word, and even of a punctuation mark, as is often the case in tagging procedures. Before we come to a characterization of the ATSs, let us devote a few words to the morphemic level, at which each word-form and punctuation mark in the text is assigned the attributes 'word-form', 'lemma' and 'tag'. Tagging is manual with the aid of the full-screen programme sgd working in the environment of Linux (which, however, can be carried on through the mediation of some remote means, e.g. from DOS). Both the entry and the output data for the programme sgd are in the format SGML according to DTD csts. As regards the volume, the aim is to attain, in cooperation with the FI MU Brno, no less than 1 million of annotated word-forms. The programme sgd requires a preliminary morphological treatment of the text, i.e., each word-form from from it is supposed to be accompanied by a list of all possible lemmas and of their (possible) morphological categories. This assignment is done automatically on the basis of an electronic dictionary (at present the vocabulary covers some 98-99% of current newspaper or magazine texts, including names). The remaining word-forms are handled by manual tagging. Typing errors are registered and corrected. In addition to this manual POS tagging, a fully automatic procedure was designed using stochastic modelling, which has been applied to the whole Czech National Corpus; this procedure works with. Due to the rich and complex inflectional morphemics of Czech (with seven morphemic cases and tens of paradigms of declension and conjugation), the number of tags is very high: the procedure works with almost 4000 combinations of morphological values, and its the error rate is about 5%. New procedures are being developed to lower this rate (using combinations of different stochastic and rule-based methods); for more details see Hajič and Hladká (1997). A certain approach to surface syntax has been specified in the ATSs, i.e. in structural trees the nodes of which are marked with 12 attributes each (see Hajič 1998; Bémová et al. 1997); among them, the attribute 'afun' ('analytical functor') indicates the kind of dependency of the given node on its governing (head) node. For technical reasons, we work with an added root of the tree, on which the main verb of the sentence depends (with 'afun' "pred"), and we use special devices for coordination and apposition constructions, as well as for "distant" dependency (in certain cases in which a deleted head word occurs in the sentence structure), compound (improper) prepositions and conjunctions parenthetic collocations, etc. An ATS contains nodes for all word-forms of the sentence, as well as for all symbols of punctuation. Among issues that present difficulties for a "surface-syntactic" analysis, there are first of all those concerning the notion of Object. We do not distinguish, in the ATSs, between 'Direct', 'Indirect' and 'Second' Object, and we label as Obj also an infinitive connected with (dependent on) a predicate. Only at the subsequent stage of tagging, in the TGTSs (see Section 3 below) these syntactically different cases are distinguished. Also with adverbials only a sigle 'afun' "Adv" is used in the ATSs, a detailed classification being reserved for the tectogrammatical tagging. Similarly, the analytical representation of numerical expressions (with which a specific function of the Genitive Case problems gets involved in Czech) does not correspond to the actual syntactic patterning (cf. sentences such as Pět žen tam už sedělo 'Five women were already sitting there', in which the verb form sedělo has Neuter gender, agreeing with the numeral pět, rather than with the Feminine noun žen, which occurs here in the Genitive case, similarly as in e.g. vlastnosti žen 'features of women', where the Genitive clearly functions as an adjunct). The classification of function words is relatively detailed in the ATSs, comprising, e.g., Pred, Sb, Obj, Adv, Atv (Predicate Complement, e.g. in Našli ho spícího 'They found him asleep'), Atr (Adjunct dependent on a noun), Pnom (Predicate Nominal with copula), AuxV auxiliary verb, Coord (Coordinating conjunction), AuxT (Reflexive particle with a 'reflexivum tantum' verb, e.g. divit se 'to wonder'), AuxR (Reflexive particle in a 'passive' (General Actor) construction, such as To se dá dobře pochopit 'One can easily understand this), AuxP (a primary preposition or a part of a secondary preposition), AuxC (a subordinating conjunction), AuxX (a comma not serving as a coordinating conj.). The annotators have also the option to indicate alternative analyses in certain cases of different possible sentence patterns without a semantic difference, e.g. AtrAtr for an adjunct of any of several preceding nouns, AtrAdv for a structural ambiguity between adverbial and adnominal dependency, or AtrObj for an ambiguity between object and adnominal adjunct without a semantic difference. 3. Dependency as the core of tectogrammatical syntax 3.1. Basic properties of tectogrammatics A tectogrammatical sentence representation may differ from the corresponding ATS since some nodes (those corresponding to function words and punctuation marks) can be eliminated and some added (representing items deleted in the outer form of the sentence, although present in its underlying structure). Up to now, a sample of about 1000 sentences has been tagged on this level in PDT. Dependency trees are present both in ATSs and on the level of TRs. However, in the TRs only the nodes corresponding to lexical (autosemantic) units; function words (or, more exactly, their functions) are represented by indices of the lexical labels, i.e. by syntactic functors and by grammatemes (which mark values of tense, aspect, modalities, number, and of other grammatical categories). While in ATSs syntactic relations are classified without many subtle differences, such as those between types of objects or of adverbials, the tectogrammatical tree structures (TGTSs) are underlying structures (basically appropriate to serve as input to semantic interpretation, see Sgall et al. 1986; Sgall 1992) and distinguish at least about 40 kinds of syntactic relations (classified in the valency grids included in the lexical entries of the head words as arguments or adjuncts, and obligatory or optional, see Panevová 1974; 1998; a detailed set of instructions for the transition from ATSs to TGTSs can be found in Hajičová et al. 2001). One significant aspect of the TGTSs is their topic-focus articulation with a scale of underlying word order; this aspect is discussed in Section 4 below. Let us just remark here for the sake of illustration that e.g. an adjective prototypically follows its head in a TGTS, even if preceding it on the surface, i.e. in the word order of the morphemic representation (a string without parentheses), cf. malý 'small' in (1); see Sgall (1967), Hajičová (1984; 1993). For technical reasons, in tagging we use nodes for coordinating conjunctions (as heads of the coordinated items), although this does not exactly correspond to the theoretical specification of the tectogrammatical level (a formal treatment of which, including all combinations of dependency and coordination and based on the detailed specification of the linguistic approach in Sgall et al. 1986, was presented by Petkevič 1995). Therefore we distinguish between tectogrammatical representations proper and Tectogrammatical Tree Structures (TGTSs), see Hajičová (1998); cf. Fig. 1, i.e. a (highly simplified) underlying tree for ex. (1). (1) Marie a Jan, kteří mají malého syna, žijí v Lomnici. Mary and John, who have small son, live in Lomnice
منابع مشابه
Introducing the Prague Discourse Treebank 1.0
We present the Prague Discourse Treebank 1.0, a collection of Czech texts annotated for various discourse-related phenomena "beyond the sentence boundary". The treebank contains manual annotations of (1), discourse connectives, their arguments and senses, (2), textual coreference, and (3), bridging anaphora, all carried out on 50k sentences of the treebank. Contrary to most similar projects, th...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملPrague Dependency Treebank Annotation Errors: A Preliminary Analysis
This paper presents a basic analysis of syntactic annotation errors and inconsistencies in the Prague Dependency Treebank, the biggest corpus of Czech with manual syntactic annotation. The corpus is used for developing and testing of many syntactic analysers of Czech and the problems in the annotation have an essential impact on the evaluation of the quality of these parsers and the results of ...
متن کاملAutomatic Procedures in Tectogrammatical Tagging
A semi-automatic syntactic annotation of a part of the Czech National Corpus in the Prague Dependency Treebank (PDT) has among its aims the possibility to check the theoretical approach chosen (Functional Generative Description, see [2]). While the first phases of the annotation of PDT, i.e. the morphemic representations and the dependency trees on an intermediate analytic level, i.e. analytic ...
متن کاملSyntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures
We deal with syntactic identification of occurrences of multiword expression (MWE) from an existing dictionary in a text corpus. The MWEs we identify can be of arbitrary length and can be interrupted in the surface sentence. We analyse and compare three approaches based on linguistic analysis at a varying level, ranging from surface word order to deep syntax. The evaluation is conducted using t...
متن کاملComplex Corpus Annotation: The Prague Dependency Treebank
The Prague Dependency Treebank (Hajič et al., 2001) is approaching the publication of its second version in which the tectogrammatical annotation is being added to the morphological and analytical (surface-syntactic) one. In this article, the Prague Dependency Treebank as a whole is being described, including its brief history. In this volume, there are three more papers with a detailed account...
متن کامل